Using dynamic microsimulation to project cognitive function in the elderly population

Background A long-term projection model based on nationally representative data and tracking disease progression across Alzheimer’s disease continuum is important for economics evaluation of Alzheimer’s disease and other dementias (ADOD) therapy. Methods The Health and Retirement Study (HRS) includes an adapted version of the Telephone Interview for Cognitive Status (TICS27) to evaluate respondents’ cognitive function. We developed an ordered probit transition model to predict future TICS27 score. This transition model is utilized in the Future Elderly Model (FEM), a dynamic microsimulation model of health and health-related economic outcomes for the US population. We validated the FEM TICS27 model using a five-fold cross validation approach, by comparing 10-year (2006–2016) simulated outcomes against observed HRS data. Results In aggregate, the distribution of TICS27 scores after ten years of FEM simulation matches the HRS. FEM’s assignment of cognitive/mortality status also matches those observed in HRS on the population level. At the individual level, the area under the receiver operating characteristic (AUROC) curve is 0.904 for prediction of dementia or dead with dementia in 10 years, the AUROC for predicting significant cognitive decline in two years for mild cognitive impairment patients is 0.722. Conclusions The FEM TICS27 model demonstrates its predictive accuracy for both two- and ten-year cognitive outcomes. Our cognition projection model is unique in its validation with an unbiased approach, resulting in a high-quality platform for assessing the burden of cognitive decline and translating the benefit of innovative therapies into long-term value to society.


Introduction
Alzheimer's disease and other dementias (ADOD) impose an increasing burden on the United States society and health care system. According to the Alzheimer's Association, the number of Americans 65 and older living with Alzheimer's dementia is estimated to grow from 5.8 million in 2019 to 13.8 million by 2050, as the baby boom generation ages [1]. On the other hand, the US Food and Drug Administration recently approved Aduhelm (Aducanumab), the first disease-modifying treatment (DMT) for ADOD, with more potential DMTs in the pipeline [2,3]. With the approval of Aduhelm, discussion arises about the therapies' real-world value and their potential costs to the healthcare system. As treatments shift in focus from dementia to earlier disease phases like mild cognitive impairment (MCI), concerns are rising about the high upfront costs for initial screening and diagnostics together with late-occurring and uncertain benefits [4,5].
Facing great opportunities and challenges in ADOD therapy development, it is important that we have the proper analysis tools to evaluate the economic impact of cognitive impairment and potential therapies. A cognitive model for use in projecting the US population should model all stages of cognitive decline for a nationally representative sample. Additionally, incorporating predictors and risk factors will make the model useful for assessing counterfactual scenarios and interventions. We identified six ADOD economic evaluation models for the US in the literature, and none of the existing models were based on data nationally representative of the US population [6][7][8][9][10][11]. Four models used the Uniform Data Set from the US National Alzheimer's Coordinating Center [6,8,10,11]. The Uniform Data Set contains data from the Alzheimer disease centers across the United States, but it is not considered a population-based sample since the enrollment of patients by participating Alzheimer disease centers is not random [6]. One model was based on the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, which is a research cohort of participants in cognitively normal, MCI and dementia states and also not commonly regarded as typical of the current clinical ADOD population [7,10]. One model was not based on a selected sample of participants but used model input parameters from multiple data sources [9]. Among the six models that used US data, three models tracked individuals across the full Alzheimer's disease continuum [7,9,11]. We also identified three models for other countries that tracked individuals across the full Alzheimer's disease continuum [12][13][14]. Among them, one model was based on six longitudinal cohort studies from different countries; one model for the UK and one model for Spain used model input parameters from multiple data sources. The Spain model represented the Spanish population aged 40 years or older from 2010 to 2050 and was validated against published life expectancy and incidence and prevalence of the dementia stages in Spain [13].
Among the ADOD economic evaluation models we identified, only one model reported rigorous validation. The Alzheimer's Disease Archimedes Condition-Event Simulator was validated by comparison of risk of mortality, institutionalization, and transition to Alzheimer's dementia predictions to external data from patient registries, clinical trials and literature. In each validation, the simulation cohort's baseline characteristics were matched to the study population in external data [7]. Several other models reported comparison of estimated transition probabilities, cognitive trajectories, incidence of dementia and survival to published literature [6,[8][9][10][11]13]. However, none of the ADOD economic evaluation models we identified in the literature validated their models by comparing simulated output against population-based samples. Model validation is a general challenge in the area of ADOD and many models suffer from this limitation, due to the lack of publicly available data [10].
In this paper, we introduce and validate a microsimulation model to project trajectories in cognitive test scores, FEM TICS27, across the full Alzheimer's disease continuum, based on

b. Data and measures
FEM uses data from the HRS, a biennial nationally-representative longitudinal survey in the population with more than 37,000 respondents over age 50 in the US. Baseline interviews with existing birth cohorts have been conducted in 1992,1993,1998,2004,2010, and 2016 with oversampling of Hispanics and African-Americans. Every six years, the HRS enrolls a new birth cohort in order to maintain a steadystate of the US population over age 50. Participants are followed through the life course with the core biennial surveys and supplemental data collections. Technical details on HRS sampling design, recruitment, and measurement are published before [19]. In this validation study, our simulation sample consisted of HRS respondents age 53 or older in 2006 in the 2006 HRS survey. All population-level analyses used HRS sample weights.
With its goal of understanding the challenges and opportunities of aging, HRS includes a section on cognition, since decline in cognitive functioning is a hallmark of aging and predictive of mortality [20]. HRS uses two different sets of measures to assess cognitive status: for respondents who complete the survey themselves ("self-respondents"), cognitive functioning is assessed using an adapted version of the Telephone Interview for Cognitive Status (TICS) [20,21]. For respondents who are not able to complete the survey themselves, questions about changes in memory in the last two years are asked to proxy respondents in the HRS.
TICS is modeled after the Mini-Mental State Exam (MMSE) for use over the telephone, and TICS scores can be converted to MMSE scores using a validated crosswalk [20,22]. TICS tests respondents' cognitive impairment and dementia status, and contains test items that evaluate memory, concentration and executive function, for example by immediate and delayed word recall, counting back from 100 by 7's, and counting back from 20. Composite scores using these test items create a measure of cognitive functioning ranging from 0 to 27 (TICS27) [20,23]. Respondents with scores from 0 to 6 are classified as having dementia, from 7 to 11 as having MCI, and from 12 to 27 as being cognitively normal. This approach was developed and validated by Langa and Weir (2010) [24] and has since been used by many studies on cognitive functioning [23,25,26]. To reduce measurement error in categorizing cognitive status based on TICS27, we require two consecutive responses for dementia: one wave with dementia followed by either dementia or death in the next wave. For MCI, we require either one wave with MCI followed by MCI, dementia or death, or one wave with dementia followed by MCI. All other cases are categorized as cognitively normal. Cognitive status is considered an absorbing state; once a respondent has been classified with "verified" dementia or MCI, we assume their cognitive status cannot revert to a less severe state.
Our model's target, TICS27 score, is missing from all respondents using a proxy respondent in HRS. Some respondents cannot participate in the interview because of cognitive problems; others might choose to use a proxy because they were working and were thus more likely to be cognitively normal. For example, in 2016, among the 941 proxy respondents, 41.9% did not think the respondent had any cognitive limitations, 6.2% thought the respondent may have some cognitive limitations and 52.0% thought the respondent had cognitive limitations that prevented him or her from being interviewed. For detailed proxy interview cognitive impairment ratings from 2006-2016 HRS, please see Technical Table A-1 in S1 Appendix. This missingness of TICS27 among respondents using a proxy was therefore assumed to be correlated with respondents' cognitive functioning, depending on the reason for using a proxy. Since this missingness is not at random, and HRS does not provide imputed TICS27 values for respondents with a proxy, we adopted a multiple imputation strategy based on HRS' approach for missing TICS27 among self-respondents. We used a combination of relevant demographic, health, and economic variables, as well as prior wave cognitive variables to perform the imputation using the sequential regression method [27]. The multiple imputation was performed using the multiple imputation (mi) command in Stata Version 16.0. Following HRS's practice, we did not impute for participants who were non-responsive to the survey in a given wave.

c. Key transition models
FEM transition models are a mixture of continuous, binary, and categorical outcomes, with a timescale that mimics the two-year structure of the HRS data. The marginal effects of two specific transition models from the FEM that are most relevant to this study, the TICS27 and mortality transition models, are shown in Table 1. For model coefficients of these two transition models, please refer to Technical Table A-2 in S1 Appendix. The TICS27 transition model was estimated with an ordered probit model using HRS data from 2008 to 2016. An ordered probit model was chosen because TICS27 is a ranked score ranging from 0 to 27. As can be seen in Table 1, variables in the TICS27 model are the TICS27 score from two and four years prior, cognitive status, demographic variables (age, gender, race and ethnicity, education), chronic disease indicators, employment, widowhood, smoking status and body mass index. Our choice of variables aligns with evidence on risk factors for cognitive decline in existing literature [28]. Also, we limited variables to information that is readily available from the survey, i.e. not based on blood samples, genetic data, or other clinical procedures. The mortality transition model was estimated with a probit model using HRS data from 2008 to 2016. Prior validation shows that FEM projections on mortality are generally in line with observed mortality rates [29].

d. Model validation approach
We validated the FEM TICS27 model at both the population-and individual-levels. At the population-level, we looked at two outcomes, TICS27 distribution comparisons and 10-year changes in a composite measure of both cognition and mortality. At the individual-level, we assessed FEM's performance in predicting dementia in 2/4/6/8/10 years and significant decline in TICS27 in two years using receiver operating characteristics (ROC) curves.
We validated the FEM's TICS27 model using a five-fold cross validation approach, by comparing 10-year simulated population-and individual-level outcomes against observed HRS data in 2016. All validation analyses results are based on 500 Monte Carlo simulations of FEM. A five-fold cross validation approach allows us to have separate datasets for estimation and simulation to evaluate FEM TICS27 model's performance in an independent dataset. Crossvalidation is a data resampling method to assess the generalizability of predictive models and to prevent overfitting [30]. To do this, we first randomly partitioned our simulation sample into five complementary subsets. We then saved one subset for simulation and used the other four subsets to estimate transition models for this simulation. We repeated this process five times so that each subset was used once for simulation. Finally, we pooled results from five simulations on subsets together for validation analyses. We chose to use five folds since it has been shown empirically that a five-to ten-fold cross validation is the optimal approach [31].
Two different samples were used in the validation analyses. One was the complete 2006 HRS sample, which included HRS respondents age 53 or older in 2006. This sample was used in population-level distribution comparison analyses. The other was the 2006 HRS with full 10-year follow-up sample, which was a subset of the complete 2006 HRS sample and used to determine the population-level 10-year change in cognitive/mortality status, and individuallevel analyses. This full 10-year follow-up sample required individuals to respond to every wave of the HRS survey from 2006 to either 2016 or their death.
Population-level outcomes include TICS27 distribution comparisons and 10-year changes in a composite measure of both cognition and mortality. We adopted this composite measure since people with dementia have high mortality rates. Individual-level outcomes include predicting dementia status in 10 years and predicting decline larger than 3 TICS27 points within 2 years for patients with MCI.
On a population-level, the distribution of simulated TICS27 in 2016 was compared to the 2016 HRS population in the same age range (age 63 or older). We also analyzed the 10-year change in status by comparing assignment of cognitive status or death by FEM in 2016 given the 2006 cognitive status to the observed status in HRS. Cognitive status at death was determined by the cognitive status in the last wave before death.
On the individual level, we assessed FEM's performance in predicting dementia in 10 years and significant decline in TICS27 in two years using receiver operating characteristics (ROC) curves. Though more commonly used in regression-based risk prediction models, ROC curves have been used for validation of other disease simulation models as well [15,32]. For individual-level analyses, we ran FEM 500 times over a 10-year time horizon for every individual in the 2006 HRS full 10-year follow-up sample. After 500 simulation iterations, we calculated the percentages of iterations for every individual with specific outcomes for two measures: (1) alive or dead with dementia in 2016; (2) significant decline (decline greater than or equal to 3 points) in TICS27 in 2008. Prior research found that a 3-point decline in MMSE indicated significant decline [33,34]. Using a crosswalk between MMSE and TICS27, a 3-point decline in MMSE translates to a 3-point decline in TICS27 for people with MCI (MMSE from 24 to 27) [35]. We then ranked every individual in the simulation by their FEM-based risks for each separate measure. These ranks were compared to their actual outcome in the HRS data to generate ROC curves. We used area under the ROC curve (AUROC), which is a commonly used measure for predictive model performance, to evaluate FEM's performance on these two measures.
We also compared our model's performance to one of the best-performing models for predicting cognitive decline, COMPASS [36]. COMPASS used data from the ADNI database with information on age, gender, education, APOE genotype and cognitive composite scores on memory and executive functions to predict changes in MMSE scores over 24-months. COM-PASS evaluated its performance on predicting significant decline in MMSE scores (3 points) in MCI subjects in 2 years using AUROC. We compared FEM's performance on predicting significant TICS27 decline in MCI subjects in 2 years to COMPASS.
The University of Southern California IRB approved this research under UP-18-00776 ("Ensuing Access to Novel Alzheimer's and Dementia Treatments") on November 21, 2019. This is a retrospective study of secondary data from the Health and Retirement Study that is de-identified and publicly available. This study uses HRS Public Release data which is fully anonymized before researchers' access. Prior to each interview, HRS participants are provided with a written informed consent information document. At the start of each interview, all HRS participants are read a confidentiality statement and give oral consent by agreeing to do the interview. Their oral consents are documented in answers to the survey questionnaire. FEM is programmed in C++, SAS and Stata, and all validation analyses were performed using Stata Version 16.0.  Table 2. Since the full follow-up sample includes relatively more  Table 4 shows the 10-year change in distributions of combined cognitive/mortality status given a respondent's initial status. The 2016 status has five categories: cognitively normal, MCI, dementia, dead without dementia, and dead with dementia. Overall, compared to HRS data, FEM assigns similar percentages of people to each cognitive/mortality category in 2016. Of HRS respondents, 56.7% retained normal cognitive function between 2006 and 2016; FEM assigns 58.5% of respondents to this category. In HRS in 2016, 9.9% of respondents were in the MCI category, and 1.9% of respondents were in the dementia category; the predictions from FEM are 9.0% and 2.5%, respectively. In HRS in 2016, 27.0% of respondents were dead without dementia and 4.5% were dead with dementia; FEM predicts 25.5% and 4.5% of respondents to be in these categories, respectively. Table 5 shows AUROC results for FEM predicting (1) dementia or death with dementia, and (2) dementia conditional on being alive in 10 years, for both the full follow-up sample and sub-population analyses (e.g. by race and ethnicity). Fig 2 shows the ROC curves for the full follow-up sample (Panels A and B) and individuals with MCI in the 2006 sample (Panels C and D). In the full follow-up sample, the AUROC for dementia or dead with dementia in 10 years is 0.904, the AUROC for dementia conditional on being alive is 0.868. FEM's performance on predicting MCI or worse is comparable to that of predicting dementia (Table 5). Furthermore, FEM's predictive performance is comparable for subgroups of age, race and ethnicity, education and disease status. For people aged 65 years or older in 2006, the AUROC for dementia or dead with dementia is 0.875. For non-Hispanic Black and Hispanic people, the AUROC for dementia or dead with dementia is 0.906 and 0.881, compared to 0.891 for non-Hispanic White people. For people without a high school degree, the AUROC for dementia or dead with dementia is 0.866, compared to 0.856 for people with high school education and 0.880 for people with at least some college education. For people who ever had a stroke before 2006, the AUROC for dementia or dead with dementia is 0.875. For people with MCI in 2006, the AUROC for dementia or dead with dementia is 0.720, and the AUROC for dementia conditional on being alive is 0.705. Table 7 shows FEM's AUROC results predicting significant decline (greater than or equal to 3 points) in TICS27 in two years and its comparison to the COMPASS model's performance in MCI subjects. The AUROC for FEM on predicting significant decline in TICS27 in two years is 0.722 for people with MCI in 2006. FEM's performance is better than Base COMPASS (AUROC 0.641), which is a machine learning model that additionally uses APOE genotype information. Advanced COMPASS (AUROC 0.814) outperforms FEM, although Advanced COMPASS includes information not only on APOE genotype but also on neuropsychological tests and validated composite scores for memory and executive functions.

Discussion
We extended the FEM microsimulation model to include a widely used cognitive test based on nationally representative HRS data, using individual-level information on demographics (age, gender, race and ethnicity, education), chronic disease indicators (heart disease, stroke, cancer, hypertension, diabetes, lung disease, heart attack), employment, smoking status, marital status and body mass index. The FEM TICS27 model can be used to estimate the future burden and long-term value of treatments of cognitive decline in the US. It also provides a 10-year risk score for dementia based on information attainable from a telephone-based survey.
To our knowledge, most disease simulation models for cognitive decline and dementia are not validated or are not validated using an unbiased approach like five-fold cross validation. Given the limited access to data and adoption of different cognitive function tests, validation of modeling methods is a general challenge in the area of ADOD [10]. We are not aware of data sources other than the HRS using TICS27 as cognitive function measurement that are available as independent datasets for external validation. Adoption of five-fold cross validation is an improvement compared to most existing economic evaluation models for ADOD in this situation to validate model performance. Using the same data for model estimation and validation can lead to an upward bias in model performance estimates due to overfitting. Although k-fold cross validation is one of the most widely used data resampling methods to estimate the true prediction error of models and to tune model parameters in risk prediction models, it is rarely used in validation for disease simulation models. Cross validation enables us to assess the generalizability of a model without using a new independent dataset, which is critical to obtaining unbiased results for model prediction performance [30,31]. The FEM TICS27 model demonstrates excellent internal validity: the TICS27 distribution and 10-year change in cognitive status generated by FEM simulation closely matches observed HRS data, and the AUROCs are larger than 0.85 for dementia prediction. For prediction of significant decline in MCI patients, FEM's performance is comparable to one of the best-performing models reported in the literature [36].

PLOS ONE
Project cognitive function in the elderly population FEM TICS27's performance on two individual-level outcomes, long-term prediction of dementia and short-term prediction of cognitive decline, is comparable to or exceeds the performance of existing models. Previously published studies reported AUROCs between 0.6 and 0.78 for prediction of AD/dementia within 3-40 years [37], which is lower than the AUROC of 0.904 reported in this study for prediction of 10-year dementia or dead with dementia. For predictions of significant decline of cognitive test scores in two years, FEM TICS27's performance is comparable with one of the best-performing models, COMPASS, which won the Dialogue for Reverse Engineering And Method (DREAM) Alzheimer's Disease Big Data challenge. One of the drawbacks of COMPASS is that it requires knowledge of detailed clinical information and APOE genotype, and is based on a selective disease registry, the ADNI database [36]. FEM TICS27 on the other hand solely relies on demographic and survey-derived variables, and can provide nationally representative estimates. Thus, FEM TICS27 demonstrated its predictive accuracy for both long-term dementia status and short-term cognitive decline outcomes. The increased performance of FEM over other models is likely because it utilizes information on individual characteristics and behavior, like smoking, widowhood, and disease history.
On the other hand, Advanced COMPASS is better at predicting outcomes for people with MCI, which is especially hard because of heterogeneity in the prognosis and the disease progression with respect to patient characteristics [38]. For this specific group, additional clinical and genotype information significantly improves prediction performance [36]. Future development of FEM TICS27 with genotype, blood-based biomarker variables, behavioral symptoms and history of medication, which are available for a subsample of the HRS, will possibly improve its performance for people with MCI. Additionally, future applications of FEM TICS27 will include analyses of differences in cognitive trajectory by education, initial cognitive status, and race and ethnicity. The model can also be implemented in microsimulations for other countries.
With Aduhelm approved as the first ADOD DMT and more DMTs in the development pipeline, the future looks promising. Though crucial, availability of DMT is only one step in enhancing cognitive function in elderly population. Demonstrating value of treatment and identification of people at risk of cognitive impairment are two very important components as well. FEM microsimulation could help with these. Understanding the long-term impact of ADOD DMTs beyond direct medical expenditure is crucial to its value demonstration [39]. As randomized controlled trials can only generate short-term evidence on the efficacy of ADOD DMTs, to demonstrate their long-term value, projection models are needed to estimate future  benefits. Based on nationally representative data and modeling a large spectrum of cognitive functioning, FEM TICS27 is a useful tool to assess the long-term impact of these future changes on the US healthcare system. Besides accurately modeling cognitive decline, FEM tracks other relevant outcomes, such as functional limitations, physical health, formal and informal care utilization, nursing home living, and medical care costs. FEM is able to provide much-needed evidence on long-term value of ADOD DMT on a broad range of outcomes. The advantage of FEM TICS27 is its high prediction accuracy using only information from a telephone-based survey. We present FEM TICS27's model structure, variables, data sources and conduct validation of its simulation outcomes against observed HRS data. We show that FEM TICS27 model can accurately predict cognitive test scores covering the full ADOD disease continuum for a nationally representative sample over a 10-year period. This paper demonstrated FEM TICS27's usefulness as a model for long-term economic evaluation for ADOD.